Down-sampling speech representation in ASR
نویسندگان
چکیده
Features for automatic speech recognition (ASR) are typically sampled at about 100 Hz (10 ms analysis step). Recent experiments indicate that the most e cient components of the modulation spectrum of speech for ASR are up to about 16 Hz [1]. Consequently, RASTA processing attenuates modulation frequencies higher than 16 Hz and should in principle allow for a subsequent down-sampling of the features. It has been shown earlier that in a Gaussian mixture model based speaker recognition system(which uses single state HMM, thus not requiring any time alignments of the incoming speech) one could down-sample the speech representation after RASTA ltering without any signi cant loss of performance [2]. However since ASR uses Viterbi time alignment, reduced number of time samples due to down-sampling, although justi ed by Nyquist criteria after the low-pass ltering, could create problems. In this paper we experimentally show that the downsampling of features after RASTA ltering is feasible and could result in considerable computational or at least storage/transmission savings. 1 Temporal processing Speech contains many source of information such as information about the linguistic message, about the speaker of the message, and about the communication channel used for the recording and transmission of the speech signal. For a given task, it is helpful to retain relevant source of information in extracted features while suppressing the irrelevant ones. In ASR, the task is to decode the linguistic message This linguistic message is coded in the movements of the vocal tract. The speech signal re ects these movements. The rate of change of the non-linguistic components in speech often lies outside the typical rate of change of the vocal tract shape. The RASTA [3] and LDA [4] techniques take advantage of this fact and bandpass lter time trajectories of speech feature vectors. 1.1 Exploiting the bandpass property RASTA lters out the fast (and slow) changes of spectral components over time. Since fast changes (high modulation frequency components) are eliminated by RASTA ltering, the Nyquist criterion would suggest that the RASTA ltered features could be sampled at a sampling rate slower than that of original nonltered features(Fig 2).
منابع مشابه
Uncertainty training and decoding methods of deep neural networks based on stochastic representation of enhanced features
Speech enhancement is an important front-end technique to improve automatic speech recognition (ASR) in noisy environments. However, the wrong noise suppression of speech enhancement often causes additional distortions in speech signals, which degrades the ASR performance. To compensate the distortions, ASR needs to consider the uncertainty of enhanced features, which can be achieved by using t...
متن کاملInforming multisource decoding in robust automatic speech recognition
Listeners are remarkably adept at recognising speech in natural multisource environments, while most Automatic Speech Recognition (ASR) technology fails in these conditions. It has been proposed that this human ability is governed by Auditory Scene Analysis (ASA) processes, in which a sound mixture is segregated into perceptual packages, called ‘streams’, by a combination of bottom-up and top-d...
متن کاملChapter 8: Acoustic Features and Distance Measure to Reduce Vulnerability of ASR Performance Due to the Presence of a Communication Channel and/or Background Noise
Saying that late 20th century automatic speech recognition (ASR) is pattern recognition, is something of a truism, but perhaps one of which the fundamental implications are not always fully appreciated. Essentially, a pattern recognition task boils down to measuring the distance between a physical representation of a new, as yet unknown token, and all elements of a set of pre-existing patterns,...
متن کاملEvidence against Frame-based Analysis Techniques
The need of ∆, ∆∆, ∆∆∆, ∆∆∆∆.... measures is a clear sign of the loss in the representation capability of classical frame-based analysis techniques. Mainly coarticulation effects in fluent speech are hidden and obscured by the classical short-time analysis technique. In fact, almost every acceptable ASR system is forced to introduce this kind of post-processing technique, in order to obviate to...
متن کاملSpeech Representation Learning Using Unsupervised Data-Driven Modulation Filtering for Robust ASR
The performance of an automatic speech recognition (ASR) system degrades severely in noisy and reverberant environments in part due to the lack of robustness in the underlying representations used in the ASR system. On the other hand, the auditory processing studies have shown the importance of modulation filtered spectrogram representations in robust human speech recognition. Inspired by these...
متن کامل